John Snow, Map of Cholera deaths in relation to water Pumps, London, 1854
October 30, 2018
John Snow, Map of Cholera deaths in relation to water Pumps, London, 1854
R can be a powerful and accesible way to do geospatial analyses. In general, there are two type of spatial data.
Vector data can be points (e.g. household locations in a survey), lines (e.g. roads or rivers), or polygons (e.g. country level administrative boundaries). These primarily come in the form of ‘shapefiles’.
An area gridded into cells at some spatial resolution each contatining a value (can be categorical or continuous):
For today, we will use administrative shapefiles (polygons) at the district level for the two Indian States for which we have data for and raster data from World Pop.
We will learn how to: 1. Read in and write out vector + raster data 2. Extract raster data to polygons 3. Aggregate data spatially and map
These are the libraries we’ll need to install for the tutorial. There are many different packages to work with geospatial data in R, but the ones we’ll use today are rgdal, sf, and raster.
# install these if not already done
# install.packages(c("rgdal", "sf", "raster", "tidyverse", "magrittr"))
library(rgdal)
## Loading required package: sp
## rgdal: version: 1.4-6, (SVN revision 841) ## Geospatial Data Abstraction Library extensions to R successfully loaded ## Loaded GDAL runtime: GDAL 2.4.2, released 2019/06/28 ## Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/rgdal/gdal ## GDAL binary built with GEOS: FALSE ## Loaded PROJ.4 runtime: Rel. 5.2.0, September 15th, 2018, [PJ_VERSION: 520] ## Path to PROJ.4 shared files: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/rgdal/proj ## Linking to sp version: 1.3-1
library(sf)
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
library(raster) library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2 ## ✔ tibble 2.1.3 ✔ dplyr 0.8.3 ## ✔ tidyr 1.0.0 ✔ stringr 1.4.0 ## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ tidyr::extract() masks raster::extract() ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ✖ dplyr::select() masks raster::select()
library(magrittr)
## ## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr': ## ## set_names
## The following object is masked from 'package:tidyr': ## ## extract
## The following object is masked from 'package:raster': ## ## extract
library(lme4)
## Loading required package: Matrix
## ## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr': ## ## expand, pack, unpack
library(patchwork)
## ## Attaching package: 'patchwork'
## The following object is masked from 'package:raster': ## ## area
To read in shapefiles, we will use the readOGR function in the rgdal package. We will read in the administrative boundary 2 (district level) shapefile for the country of India (we’ll call it ‘india_shp’ for now).
One thing you might notice is that there are many different files with different extensions in the folder data/India where the national shapefile is stored.
These files must be stored together and are necessary to read in the shapefile, but the one you select is the file with the extension ‘.shp’.
india_shp <- readOGR("data/India/India.shp")
## OGR data source with driver: ESRI Shapefile ## Source: "/Users/mrajeev/Documents/Projects/mapping-chennai/data/India/India.shp", layer: "India" ## with 666 features ## It has 16 fields
This is a really big file and as we are only working with data from two states, Assam and Punjab, we can subset these spatial data, just as you would a regular data frame. If you look at the data frame (try head and names), then you will see that the field for the state level is name_1. We want to subset this to Punjab and Assam.
punjab <- subset(india_shp, name_1 == "Punjab") plot(punjab, main = "Punjab State")
Now try the same for Assam:
Now try the same for Assam:
assam <- subset(india_shp, name_1 == "Assam") plot(assam, main = "Assam State")
If you want to save shapefiles you’ve made, you can write out the shapefile using the writeOGR function. You need to specify the object you want to write, the name of the folder where the files will go, the name of the file, and the driver. When writing shapefiles, in general you can use ‘ESRI Shapefile’ as the driver.
writeOGR(punjab, "output/Punjab", "Punjab", driver = "ESRI Shapefile", overwrite = TRUE)
The other spatial data we will work with today is high resolution estimates of human populations from the World Pop Project (you can read more about how these data are created and what else is available at http://www.worldpop.org.uk). This data gives you the number of people per 100 x 100m grid cell, but I’ve aggregated it up to 10 x 10 km km for today so that it’s easier to work with (checkout the raster::aggregate function if you want to try it yourself).
We will use the raster function to read in the raster file:
pop <- raster("data/india_pop.tif")
names(pop) <- "pop"
plot(pop)
If you want to write out the raster data:
writeRaster(pop, "output/pop.tif")
We can extract this data to our admin shapefiles for Assam and Punjab to get the estimated human population per district for each state. We do that using the crop, mask, and extract functions from the raster package. Note of caution: this can be a very slow process with large shapefiles or high resolution rasters!
assam_pop <- crop(pop, assam) # First we crop our pop raster to Assam assam_pop <- mask(assam_pop, assam) # Then we have to mask this so that all values outside of Assam are set to NA
Try it for Punjab now:
Try it for Punjab now:
# Try it for Punjab now: punjab_pop <- crop(pop, punjab) punjab_pop <- mask(punjab_pop, punjab)
We can plot to make sure it worked:
par(mfrow = c(1, 2), mar = c(2, 2, 2, 2)) plot(punjab_pop, main = "Human pop Punjab") plot(punjab, add = TRUE) plot(assam_pop, main = "Human pop Assam") plot(assam, add = TRUE)
Now we will extract these data, which basically means we will take all the grid cells that fall within a polygon and calculate a summary statistic. In our case we want to get the sum of values (i.e. the number of people living in each district).
assam <- extract(assam_pop, assam, fun = sum, # we take the sum of the grid cells in each district
na.rm = TRUE, # grid cells with NA pops are ignored
sp = TRUE) # gives us back a spatial polygons data frame
Try it for Punjab now:
Try it for Punjab now:
punjab <- extract(punjab_pop, punjab, fun = sum, na.rm = TRUE, sp = TRUE)
We don’t actually have spatial data available, so I’ve assigned a district randomly to each individual in the survey. We can now aggregate this data to the district level. A new package called sf facilitates using the tidyverse (i.e. dplyr + ggplot2) for spatial data. For the rest of the tutorial we’ll use this package to make data manipulation and plotting easier.
First we need to convert our spatial data frames to ‘sf’ objects:
# convert using st_as sf function punjab_sf <- st_as_sf(punjab)
## Warning in fun(libname, pkgname): rgeos: versions of GEOS runtime 3.8.0-CAPI-1.13.1 ## and GEOS at installation 3.7.2-CAPI-1.11.2differ
assam_sf <- st_as_sf(assam)
sf (simple features) objectsCheck out the sf objects:
class(punjab_sf)
## [1] "sf" "data.frame"
sf (simple features) objectsCheck out the sf objects:
head(punjab_sf)
## Simple feature collection with 6 features and 17 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: 73.87089 ymin: 29.77701 xmax: 76.58119 ymax: 32.04543 ## epsg (SRID): 4326 ## proj4string: +proj=longlat +datum=WGS84 +no_defs ## iso admn_lv name_0 id_0 type_0 name_1 id_1 type_1 name_2 ## 465 IND 2 India 10000881 Country Punjab 10315044 State Amritsar ## 466 IND 2 India 10000881 Country Punjab 10315044 State Faridkot ## 467 IND 2 India 10000881 Country Punjab 10315044 State Barnala ## 471 IND 2 India 10000881 Country Punjab 10315044 State Bathinda ## 474 IND 2 India 10000881 Country Punjab 10315044 State Fatehgarh Sahib ## 476 IND 2 India 10000881 Country Punjab 10315044 State Fazilka ## id_2 type_2 name_3 id_3 type_3 source cntry_l pop ## 465 10019070 District NA NA NA GADM v2.8 IND_2 2459497.1 ## 466 10019073 District NA NA NA GADM v2.8 IND_2 669828.9 ## 467 10019071 District NA NA NA GADM v2.8 IND_2 699785.0 ## 471 10019072 District NA NA NA GADM v2.8 IND_2 1500438.6 ## 474 10019074 District NA NA NA GADM v2.8 IND_2 652263.9 ## 476 10019075 District NA NA NA GADM v2.8 IND_2 1092486.0 ## geometry ## 465 MULTIPOLYGON (((74.92346 32... ## 466 MULTIPOLYGON (((74.89951 30... ## 467 MULTIPOLYGON (((75.71488 30... ## 471 MULTIPOLYGON (((75.17096 30... ## 474 MULTIPOLYGON (((76.52028 30... ## 476 MULTIPOLYGON (((74.2091 30....
They are data frames with geometries stored as a single column in a list.
sf objectsIf you use the plot function on an sf object, you need to specify which variable to plot by:
# plot using sf notation--lets just do the pop plot(punjab_sf["pop"]) # see the help on plot_sf for more options on customization
If you don’t specify, you end up trying to plot all the attributes, which can be slow and messy!
sf objectsYou can also use ggplot (I prefer this):
# you can also use ggplot (ggplot(assam_sf) + geom_sf(aes(fill = pop))) + (ggplot(punjab_sf) + geom_sf(aes(fill = pop)))
We can read in our serodata from earlier and we will change the qualitative results to 0 for positive and 1 for negative to make summarizing easier:
df.eia <- read_csv("data/serodata_spatial.csv")
## Parsed with column specification: ## cols( ## SITEID = col_character(), ## AGE = col_double(), ## REVQUAL_M = col_character(), ## REVQUAL_R = col_character(), ## district = col_character() ## )
df.eia %>% # change so that results are 0/1 to more easily aggregate
mutate(REVQUAL_M = ifelse(REVQUAL_M =="Positive", 1, 0),
REVQUAL_R = ifelse(REVQUAL_R =="Positive", 1, 0)) -> df.eia
We will now calculate summary statistics for each variable
df.eia %>%
filter(SITEID == "Assam") %>% # filter for Assam
group_by(district) %>% # group by district
summarize(sample_size= n(), pos_measles = sum(REVQUAL_M),
pos_rub = sum(REVQUAL_R),
avg_age_sampled = mean(AGE),
avg_age_posM = sum(AGE*REVQUAL_M)/pos_measles,
avg_age_posR = sum(AGE*REVQUAL_R)/pos_rub) %>%
right_join(assam_sf, by = c("district" = "name_2")) %>%
st_as_sf() -> assam_dist
## Warning: Column `district`/`name_2` joining character vector and factor, ## coercing into character vector
Now try for Punjab:
Now try for Punjab:
# Do the same for punjab to get the same summarized vars for each district
df.eia %>%
filter(SITEID == "Punjab") %>% # filter for Assam
group_by(district) %>% # group by district
summarize(sample_size= n(), pos_measles = sum(REVQUAL_M),
pos_rub = sum(REVQUAL_R),
avg_age_sampled = mean(AGE),
avg_age_posM = sum(AGE*REVQUAL_M)/pos_measles,
avg_age_posR = sum(AGE*REVQUAL_R)/pos_rub) %>%
right_join(punjab_sf, by = c("district" = "name_2")) %>%
st_as_sf() -> punjab_dist
## Warning: Column `district`/`name_2` joining character vector and factor, ## coercing into character vector
Now we can visualize our serodata spatially. Lets map the proportion seropositive for measles by district for each state:
# Map the average age of sampled + infection
assam_dist %>%
ggplot() + geom_sf(aes(fill = pos_measles/sample_size)) +
scale_fill_continuous(name = "Proportion \n seropositive: \n measles",
type = "viridis")
punjab_dist %>%
ggplot() + geom_sf(aes(fill = pos_measles/sample_size)) +
scale_fill_continuous(name = "Proportion \n seropositive: \n measles",
type = "viridis")
# Map the average age of sampled + infection for rubella for the state of assam assam_dist %>% ggplot() + geom_sf(aes(fill = avg_age_sampled)) + scale_fill_continuous(name = "Average \n age sampled", type = "viridis")
While we don’t have true spatially resolved data at the district level, we will use the data that we generated to practice what we learned earlier today. Note: this is just to learn how to work with data rather than a question driven modeling approach (which is always preferrable;)
In the data folder is a raster dataset of births at a 10 x 10 km km grid for India taken from World Pop. Extract this data to the district level for Assam and Punjab (you will need to use the original spatial polygon data frames, i.e. punjab not punjab_sf)
births <- raster("data/india_births.tif")
names(births) <- "births"
punjab_births <- crop(births, punjab)
punjab_births <- mask(punjab_births, punjab)
punjab <- extract(punjab_births, punjab, fun = sum, na.rm = TRUE, sp = TRUE)
assam_births <- crop(births, assam)
assam_births <- mask(assam_births, assam)
assam <- extract(assam_births, assam, fun = sum, na.rm = TRUE, sp = TRUE)
Explore the relationship between seropositivity (the response variable) at the district level and the per capita birthrate (the predictor variable, calculate as the number of births/population) for rubella for Punjab.
Hint: you will want to use family = binomial, but this time your data will follow the following formula: glm(cbind(successes, failures) ~ predictor, family = binomial) where the successes = # of seropositives and failures = # of seronegatives.
punjab_sf <- st_as_sf(punjab)
df.eia %>%
filter(SITEID == "Punjab") %>% # filter for Punjab
group_by(district) %>% # group by district
summarize(sample_size= n(), pos_measles = sum(REVQUAL_M),
pos_rub = sum(REVQUAL_R),
avg_age_sampled = mean(AGE),
avg_age_posM = sum(AGE*REVQUAL_M)/pos_measles,
avg_age_posR = sum(AGE*REVQUAL_R)/pos_rub) %>%
right_join(punjab_sf, by = c("district" = "name_2")) -> punjab_dist
## Warning: Column `district`/`name_2` joining character vector and factor, ## coercing into character vector
punjab_dist %$% glm(cbind(pos_rub, sample_size - pos_rub) ~ births/pop, family = binomial)
## ## Call: glm(formula = cbind(pos_rub, sample_size - pos_rub) ~ births/pop, ## family = binomial) ## ## Coefficients: ## (Intercept) births births:pop ## 2.615e-01 -1.932e-05 2.896e-12 ## ## Degrees of Freedom: 21 Total (i.e. Null); 19 Residual ## Null Deviance: 33.93 ## Residual Deviance: 32.15 AIC: 119.1
Combine both Punjab and Assam datasets and add in state as a random effect!**
Hint: you will need to use the glmer function from the lme4 package.
library(lme4)
assam_sf <- st_as_sf(assam)
df.eia %>%
filter(SITEID == "Assam") %>% # filter for assam
group_by(district) %>% # group by district
summarize(sample_size= n(), pos_measles = sum(REVQUAL_M),
pos_rub = sum(REVQUAL_R),
avg_age_sampled = mean(AGE),
avg_age_posM = sum(AGE*REVQUAL_M)/pos_measles,
avg_age_posR = sum(AGE*REVQUAL_R)/pos_rub) %>%
right_join(assam_sf, by = c("district" = "name_2")) -> assam_dist
## Warning: Column `district`/`name_2` joining character vector and factor, ## coercing into character vector
df_combined <- bind_rows(assam_dist, punjab_dist)
## Warning in bind_rows_(x, .id): Vectorizing 'sfc_MULTIPOLYGON' elements may not ## preserve their attributes
## Warning in bind_rows_(x, .id): Vectorizing 'sfc_MULTIPOLYGON' elements may not ## preserve their attributes
df_combined %>% mutate(birth_rate = births/pop) -> df_combined glmer(cbind(pos_rub, sample_size - pos_rub) ~ birth_rate + (1 | name_1), family = binomial, data = df_combined)
## boundary (singular) fit: see ?isSingular
## Generalized linear mixed model fit by maximum likelihood (Laplace ## Approximation) [glmerMod] ## Family: binomial ( logit ) ## Formula: cbind(pos_rub, sample_size - pos_rub) ~ birth_rate + (1 | name_1) ## Data: df_combined ## AIC BIC logLik deviance df.resid ## 244.4491 250.1246 -119.2246 238.4491 46 ## Random effects: ## Groups Name Std.Dev. ## name_1 (Intercept) 0 ## Number of obs: 49, groups: name_1, 2 ## Fixed Effects: ## (Intercept) birth_rate ## -0.7297 45.3317 ## convergence code 0; 1 optimizer warnings; 0 lme4 warnings
I didn’t actually assign individuals randomly to districts! Can you figure out which variable I used to allocate sampled individuals (hint: all the data you have is the data I have:)?
df_combined %$% plot(pop, sample_size, bty = "l", xlab = "District pop", ylab = "Number sampled")
punjab$samples <- as.vector(rmultinom(n = 1, size = nrow(df.eia %>% filter(SITEID == "Punjab")),
prob = punjab$pop))
assam$samples <- as.vector(rmultinom(n = 1, size = nrow(df.eia %>% filter(SITEID == "Assam")),
prob = assam$pop))
alloc_punjab <- rep(punjab$name_2, punjab$samples)
alloc_assam <- rep(assam$name_2, assam$samples)
# The jumble function!
jumble <- function(x) {
jumbled <- x[order(runif(length(x), min = 0, max = 1))]
return(jumbled)
}
df.eia %>%
arrange(SITEID) %>% # in ascending order with Assam 1st!
mutate(district = c(as.character(jumble(alloc_assam)),
as.character(jumble(alloc_punjab)))) -> df.eia
https://geocompr.robinlovelace.net
https://map.ox.ac.uk (Mainly malaria data, but also accessibility, and a nice R package which includes a way to access admin shapefiles by country)
sessionInfo()
## R version 3.6.1 (2019-07-05) ## Platform: x86_64-apple-darwin15.6.0 (64-bit) ## Running under: macOS Mojave 10.14.6 ## ## Matrix products: default ## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] patchwork_1.0.0.9000 lme4_1.1-21 Matrix_1.2-17 ## [4] magrittr_1.5 forcats_0.4.0 stringr_1.4.0 ## [7] dplyr_0.8.3 purrr_0.3.2 readr_1.3.1 ## [10] tidyr_1.0.0 tibble_2.1.3 ggplot2_3.2.1 ## [13] tidyverse_1.2.1 raster_3.0-7 sf_0.8-0 ## [16] rgdal_1.4-6 sp_1.3-1 knitr_1.27 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_1.0.3 lubridate_1.7.4 lattice_0.20-38 class_7.3-15 ## [5] assertthat_0.2.1 zeallot_0.1.0 digest_0.6.23 R6_2.4.1 ## [9] cellranger_1.1.0 backports_1.1.5 evaluate_0.14 e1071_1.7-2 ## [13] httr_1.4.1 pillar_1.4.2 rlang_0.4.4 lazyeval_0.2.2 ## [17] readxl_1.3.1 minqa_1.2.4 rstudioapi_0.10 nloptr_1.2.1 ## [21] rmarkdown_2.1 labeling_0.3 splines_3.6.1 munsell_0.5.0 ## [25] broom_0.5.2 compiler_3.6.1 modelr_0.1.5 xfun_0.12 ## [29] pkgconfig_2.0.3 rgeos_0.5-2 htmltools_0.4.0 tidyselect_0.2.5 ## [33] codetools_0.2-16 viridisLite_0.3.0 crayon_1.3.4 withr_2.1.2 ## [37] MASS_7.3-51.4 grid_3.6.1 nlme_3.1-141 jsonlite_1.6 ## [41] gtable_0.3.0 lifecycle_0.1.0 DBI_1.0.0 units_0.6-4 ## [45] scales_1.0.0 KernSmooth_2.23-15 cli_1.1.0 stringi_1.4.5 ## [49] xml2_1.2.2 vctrs_0.2.0 generics_0.0.2 boot_1.3-23 ## [53] tools_3.6.1 glue_1.3.1 hms_0.5.1 parallel_3.6.1 ## [57] yaml_2.2.0 colorspace_1.4-1 classInt_0.4-1 rvest_0.3.4 ## [61] haven_2.1.1